Aim

The project aims to create machine learning models using information taken from digitalized fine needle aspirate (FNA) pictures to predict whether breast cancer is benign or malignant and will choose the best ML model for this dataset.

Dataset: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data

UCI: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Import libaries and load data

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
In [7]:
breast_data = pd.read_csv('data.csv')

breast_data.head()
Out[7]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

In [8]:
breast_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Data cleaning

In [9]:
breast_data.drop(columns = ['id','Unnamed: 32'], axis=1, inplace=True)
In [10]:
missing_values = breast_data.isnull().sum()
print("Missing Values:\n", missing_values)
Missing Values:
 diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64
In [11]:
duplicate_rows = breast_data.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())
Number of duplicate rows: 0
In [12]:
# Summary Statistics for Features
feature_summary = breast_data.drop('diagnosis', axis=1).describe()
feature_summary
Out[12]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

Data visualization -

Diagnosis distribution between beign and malignant

In [13]:
diagnosis_distribution = breast_data['diagnosis'].value_counts().reset_index()
diagnosis_distribution.columns = ['Diagnosis', 'Count']

# Assigning colors to each diagnosis category
colors = {'M': 'darkred', 'B': 'steelblue'}

fig = px.bar(diagnosis_distribution, x='Diagnosis', y='Count', color='Diagnosis',
             color_discrete_map=colors, title='Distribution of Malignant (M) and Benign (B) Diagnoses',
             labels={'Diagnosis': 'Diagnosis', 'Count': 'Count'})

# Customize the layout
fig.update_layout(showlegend=False)  # Hide legend for better aesthetics
fig.update_traces(marker_line_width=0)  # Remove border around bars for a cleaner look

fig.show()

Heatmap

In [186]:
relationship = breast_data.columns
plt.figure(figsize=(20, 15))
sns.heatmap(breast_data[relationship[1:]].corr(), annot=True, fmt=".2f")
plt.show()

Distribution of features

In [187]:
num_features = df.drop('diagnosis', axis=1)

plt.figure(figsize=(16, 10))
for i, feature in enumerate(num_features.columns, 1):
    plt.subplot(5, 6, i)
    sns.histplot(num_features[feature], kde=True)
    plt.title(f'Distribution of {feature}')

plt.tight_layout()
plt.show()
In [14]:
selected_features = ['concave points_worst', 'perimeter_worst', 'concave points_mean', 'radius_worst',
                      'perimeter_mean', 'area_worst', 'radius_mean', 'area_mean', 'concavity_mean', 'compactness_mean']
selected_features.append('diagnosis')
selected_breast_data = breast_data[selected_features]

Training test set splitting

By splitting up the data, the machine learning model is better equipped to predict outcomes accurately in new and untested scenarios.

In [16]:
from sklearn.model_selection import train_test_split

# splitting data
X_train, X_test, y_train, y_test = train_test_split(
                breast_data.drop('diagnosis', axis=1),
                breast_data['diagnosis'],
                test_size=0.2,
                random_state=42)

print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape)
Shape of training set: (455, 30)
Shape of test set: (114, 30)
  • The training set (X_train and y_train) is used to train the machine learning model to learn the patterns and relationships within the data.

  • The test set (X_test and y_test) is then used to evaluate the model's performance by making predictions on unseen data. The difference between the predicted labels and the actual labels in the test set indicates how well the model generalizes to new, unseen instances.

Data scaling

It guarantees that the model can generalise effectively to a variety of datasets, improves performance, and helps to consistent model behaviour.

In [17]:
# scaling data
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance

KNeighbors Classifier -

The number of neighbours that are taken into account is set by the n_neighbors argument. The method can capture intricate, non-linear decision boundaries and is adaptable and non-parametric. Its performance depends on careful evaluation of distance measures and proper feature scaling.

In [191]:
# to find which value shows the lowest mean error
error_rate = []

for i in range(1,42):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
In [192]:
plt.figure(figsize=(12,6))
plt.plot(range(1,42), error_rate, color='purple', linestyle="--",
         marker='o', markersize=10, markerfacecolor='b')
plt.title('Error_Rate vs K-value')
plt.show()
In [193]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
prediction1 = knn.predict(X_test)
In [194]:
print(confusion_matrix(y_test, prediction1))
print("\n")
print(classification_report(y_test, prediction1))
[[70  1]
 [ 4 39]]


              precision    recall  f1-score   support

           B       0.95      0.99      0.97        71
           M       0.97      0.91      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

The classification report, one may learn more about the model's advantages and disadvantages in terms of foretelling benign and malignant situations.

In [195]:
knn_model_acc = accuracy_score(y_test, prediction1)
print("Accuracy of K Neighbors Classifier Model is: ", knn_model_acc)
Accuracy of K Neighbors Classifier Model is:  0.956140350877193
In [86]:
from sklearn.model_selection import cross_val_score

KNeighborsClassifier_cross_val = cross_val_score(KNeighborsClassifier(),X_train,y_train)
print("Cross validation score of KNeighborsClassifier Model:")

count = 0
for i in KNeighborsClassifier_cross_val:
    count+=1
    print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of KNeighborsClassifier Model:
1) 96.7 %
2) 95.6 %
3) 98.9 %
4) 96.7 %
5) 92.31 %

Cross-validation gives a more accurate estimation of the model's performance and makes sure it is not overfitting to a particular train-test split. This is critical for healthcare applications since the model's generalisation capacity and reliability are critical for producing precise predictions on new, unseen patient data.

Linear Regression model done from scratch -

linear regression is used in the context of breast cancer prediction. It attempts to establish a linear relationship between measurable features of cancer cells and a quantitative measure of cancer severity.

In [197]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load breast cancer dataset
data = load_breast_cancer()
X = data.data[:, 0].reshape(-1, 1)  # Using only one feature for simplicity
y = data.target.reshape(-1, 1)

# Linear Regression Implementation
class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        # Add a bias term to the input
        X_bias = np.c_[np.ones((X.shape[0], 1)), X]

        # Initialize weights and bias
        self.weights = np.random.randn(X_bias.shape[1], 1)
        self.bias = np.zeros((1, 1))

        for _ in range(self.n_iterations):
            # Compute predictions
            predictions = np.dot(X_bias, self.weights) + self.bias

            # Compute gradients
            dw = (1 / X_bias.shape[0]) * np.dot(X_bias.T, (predictions - y))
            db = (1 / X_bias.shape[0]) * np.sum(predictions - y)

            # Update weights and bias
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        X_bias = np.c_[np.ones((X.shape[0], 1)), X]
        return np.dot(X_bias, self.weights) + self.bias

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

# Make predictions
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
predictions = linear_reg_model.predict(X_test)

# Plot the data and the linear regression line
plt.scatter(X_test[:, 0], y_test[:, 0], label='Data')
plt.plot(X_test[:, 0], predictions[:, 0], label='Linear Regression', color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
In [198]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression

# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Convert target to binary labels (0 or 1)
y_binary = (y > 0).astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Add a bias term to the input for linear regression
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]

# Create and train the Linear Regression model
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train_bias, y_train)

# Make predictions
predictions = linear_reg_model.predict(X_test_bias)

# Convert predictions to binary labels (0 or 1)
predictions_binary = (predictions > 0.5).astype(int)

# Display classification report
print("Classification Report:\n", classification_report(y_test, predictions_binary))
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.88      0.92        43
           1       0.93      0.97      0.95        71

    accuracy                           0.94       114
   macro avg       0.94      0.93      0.93       114
weighted avg       0.94      0.94      0.94       114

In [199]:
# Linear Regression Implementation
class LinearRegressionModel:
    def __init__(self):
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        # Add a bias term to the input
        X_bias = np.c_[np.ones((X.shape[0], 1)), X]

        # Compute the weights using the normal equation
        theta = np.linalg.inv(X_bias.T @ X_bias) @ X_bias.T @ y

        # Extract weights and bias
        self.bias = theta[0]
        self.weights = theta[1:]

    def predict(self, X):
        X_bias = np.c_[np.ones((X.shape[0], 1)), X]
        return X_bias @ np.concatenate(([self.bias], self.weights))

# Create and train the Linear Regression model
linear_reg_model = LinearRegressionModel()
linear_reg_model.fit(X_train, y_train)

# Make predictions
predictions = linear_reg_model.predict(X_test)

# Calculate Mean Squared Error for evaluation
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

# Perform k-fold cross-validation
def k_fold_cross_validation(model, X, y, k=5):
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    mse_scores = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model.fit(X_train, y_train)
        predictions = model.predict(X_test)

        mse = mean_squared_error(y_test, predictions)
        mse_scores.append(mse)

    return np.mean(mse_scores)

# Use k-fold cross-validation with the Linear Regression model
mse_cv = k_fold_cross_validation(linear_reg_model, X, y)
print(f'Mean Squared Error (Cross-Validation): {mse_cv}')
Mean Squared Error: 0.06410886246958959
Mean Squared Error (Cross-Validation): 0.0626649678814163

Naive Bayes -

To predict whether a tumour is likely to be benign or malignant, the Naive Bayes method uses the probability of these characteristics given the class labels.

In [54]:
nb = GaussianNB()       #We are building our model
nb.fit(X_train,y_train) #We are training our model
print("Print accuracy of naive bayes algo: {}".format(nb.score(X_test,y_test)))
nb_acc_score = nb.score(X_test,y_test)
Print accuracy of naive bayes algo: 0.956140350877193
In [56]:
y_pred = nb.predict(X_test)
y_true = y_test

cm = confusion_matrix(y_true, y_pred)

#visualize
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm,annot = True, linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()
In [58]:
print(confusion_matrix(y_test,y_pred))
print("\n")
print(classification_report(y_test, y_pred))
[[70  1]
 [ 4 39]]


              precision    recall  f1-score   support

           B       0.95      0.99      0.97        71
           M       0.97      0.91      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

In [88]:
from sklearn.model_selection import cross_val_score

GaussianNB_cross_val = cross_val_score(GaussianNB(),X_train,y_train)
print("Cross validation score of Bayesian classification Model:")

count = 0
for i in GaussianNB_cross_val:
    count+=1
    print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of Bayesian classification Model:
1) 90.11 %
2) 96.7 %
3) 93.41 %
4) 93.41 %
5) 93.41 %

Logistic Regression -

Because they can produce distinct binary outcomes, logistic regression and other classification models are frequently used in the medical field for tasks like cancer prediction.

In [63]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
predictions1 = logreg.predict(X_test)
In [64]:
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions1))
print('\n')
print(classification_report(y_test, predictions1))
Confusion Matrix: 
 [[71  0]
 [ 2 41]]


              precision    recall  f1-score   support

           B       0.97      1.00      0.99        71
           M       1.00      0.95      0.98        43

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

In [65]:
logreg_acc = accuracy_score(y_test, predictions1)
print("Accuracy of the Logistic Regression Model is: ", logreg_acc)
Accuracy of the Logistic Regression Model is:  0.9824561403508771
In [89]:
from sklearn.model_selection import cross_val_score

LogisticRegression_cross_val = cross_val_score(LogisticRegression(),X_train,y_train)
print("Cross validation score of Logistic Regression Model:")

count = 0
for i in LogisticRegression_cross_val:
    count+=1
    print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of Logistic Regression Model:
1) 97.8 %
2) 96.7 %
3) 100.0 %
4) 97.8 %
5) 94.51 %

Conclusion -

The best ML model for this dataset will be determined by each models accuracy.

Accuracy of linear regression model: 95
Accuracy of naive bayes model: 95
Accuracy of the Logistic Regression Model is: 98
Accuracy of K Neighbors Classifier Model is:  95

Hence, the best model for this dataset is logistic regression models.

This model will help to predict whether the breast cancer diagnosis is benign or malignant